31 research outputs found

    The Neural Representation Benchmark and its Evaluation on Brain and Machine

    Get PDF
    A key requirement for the development of effective learning representations is their evaluation and comparison to representations we know to be effective. In natural sensory domains, the community has viewed the brain as a source of inspiration and as an implicit benchmark for success. However, it has not been possible to directly test representational learning algorithms directly against the representations contained in neural systems. Here, we propose a new benchmark for visual representations on which we have directly tested the neural representation in multiple visual cortical areas in macaque (utilizing data from [Majaj et al., 2012]), and on which any computer vision algorithm that produces a feature space can be tested. The benchmark measures the effectiveness of the neural or machine representation by computing the classification loss on the ordered eigendecomposition of a kernel matrix [Montavon et al., 2011]. In our analysis we find that the neural representation in visual area IT is superior to visual area V4. In our analysis of representational learning algorithms, we find that three-layer models approach the representational performance of V4 and the algorithm in [Le et al., 2012] surpasses the performance of V4. Impressively, we find that a recent supervised algorithm [Krizhevsky et al., 2012] achieves performance comparable to that of IT for an intermediate level of image variation difficulty, and surpasses IT at a higher difficulty level. We believe this result represents a major milestone: it is the first learning algorithm we have found that exceeds our current estimate of IT representation performance. We hope that this benchmark will assist the community in matching the representational performance of visual cortex and will serve as an initial rallying point for further correspondence between representations derived in brains and machines.Comment: The v1 version contained incorrectly computed kernel analysis curves and KA-AUC values for V4, IT, and the HT-L3 models. They have been corrected in this versio

    Spatial-frequency channels, shape bias, and adversarial robustness

    Full text link
    What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel'') that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. On the other hand, the neural network channel, across various architectures and training strategies, is 2-4 times as wide as the human channel. In other words, networks are vulnerable to high and low frequency noise that does not affect human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (53% variance explained) and with robustness of adversarially-trained networks (74% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further away from the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only increases this difference.Comment: Accepted to Neural Information Processing Systems (NeurIPS) 2023 (Oral Presentation

    Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition

    Get PDF
    The primate visual system achieves remarkable visual object recognition performance even in brief presentations and under changes to object exemplar, geometric transformations, and background variation (a.k.a. core visual object recognition). This remarkable performance is mediated by the representation formed in inferior temporal (IT) cortex. In parallel, recent advances in machine learning have led to ever higher performing models of object recognition using artificial deep neural networks (DNNs). It remains unclear, however, whether the representational performance of DNNs rivals that of the brain. To accurately produce such a comparison, a major difficulty has been a unifying metric that accounts for experimental limitations such as the amount of noise, the number of neural recording sites, and the number trials, and computational limitations such as the complexity of the decoding classifier and the number of classifier training examples. In this work we perform a direct comparison that corrects for these experimental limitations and computational considerations. As part of our methodology, we propose an extension of "kernel analysis" that measures the generalization accuracy as a function of representational complexity. Our evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task. Furthermore, we show that models that perform well on measures of representational performance also perform well on measures of representational similarity to IT and on measures of predicting individual IT multi-unit responses. Whether these DNNs rely on computational mechanisms similar to the primate visual system is yet to be determined, but, unlike all previous bio-inspired models, that possibility cannot be ruled out merely on representational performance grounds.Comment: 35 pages, 12 figures, extends and expands upon arXiv:1301.353

    Efficient Coding of Local 2D Shape

    Get PDF
    Efficient coding provides a concise account of key early visual properties, but can it explain higher-level visual function such as shape perception? If curvature is a key primitive of local shape representation, efficient shape coding predicts that sensitivity of visual neurons should be determined by naturally-occurring curvature statistics, which follow a scale-invariant power-law distribution. To assess visual sensitivity to these power-law statistics, we developed a novel family of synthetic maximum-entropy shape stimuli that progressively match the local curvature statistics of natural shapes, but lack global structure. We find that humans can reliably identify natural shapes based on 4th and higher-order moments of the curvature distribution, demonstrating fine sensitivity to these naturally-occurring statistics. What is the physiological basis for this sensitivity? Many V4 neurons are selective for curvature and analysis of population response suggests that neural population sensitivity is optimized to maximize information rate for natural shapes. Further, we find that average neural response in the foveal confluence of early visual cortex increases as object curvature converges to the naturally-occurring distribution, reflecting an increased upper bound on information rate. Reducing the variance of the curvature distribution of synthetic shapes to match the variance of the naturally-occurring distribution impairs the linear decoding of individual shapes, presumably due to the reduction in stimulus entropy. However, matching higher-order moments improves decoding performance, despite further reducing stimulus entropy. Collectively, these results suggest that efficient coding can account for many aspects of curvature perception

    Grouping in object recognition: The role of a Gestalt law in letter identification

    Get PDF
    The Gestalt psychologists reported a set of laws describing how vision groups elements to recognize objects. The Gestalt laws “prescribe for us what we are to recognize ‘as one thing’” (Köhler, 1920). Were they right? Does object recognition involve grouping? Tests of the laws of grouping have been favourable, but mostly assessed only detection, not identification, of the compound object. The grouping of elements seen in the detection experiments with lattices and “snakes in the grass” is compelling, but falls far short of the vivid everyday experience of recognizing a familiar, meaningful, named thing, which mediates the ordinary identification of an object. Thus, after nearly a century, there is hardly any evidence that grouping plays a role in ordinary object recognition. To assess grouping in object recognition, we made letters out of grating patches and measured threshold contrast for identifying these letters in visual noise as a function of perturbation of grating orientation, phase, and offset. We define a new measure, “wiggle”, to characterize the degree to which these various perturbations violate the Gestalt law of good continuation. We find that efficiency for letter identification is inversely proportional to wiggle and is wholly determined by wiggle, independent of how the wiggle was produced. Thus the effects of three different kinds of shape perturbation on letter identifiability are predicted by a single measure of goodness of continuation. This shows that letter identification obeys the Gestalt law of good continuation and may be the first confirmation of the original Gestalt claim that object recognition involves grouping

    Deep learning—Using machine learning to study biological vision

    No full text

    Are faces processed like words? A diagnostic test for recognition by parts

    No full text
    Do we identify an object as a whole or by its parts? This simple question has been surprisingly hard to answer. It has been suggested that faces are recognized as wholes and words are recognized by parts. Here we answer the question by applying a test for crowding. In crowding, a target is harder to identify in the presence of nearby flankers. Previous work has described crowding between objects. We show that crowding also occurs between the parts of an object. Such internal crowding severely impairs perception, identification, and fMRI face-area activation. We apply a diagnostic test for crowding to a word and a face, and we find that the critical spacing of the parts required for recognition is proportional to distance from fixation and independent of size and kind. The critical spacing defines an isolation field around the target. Some objects can be recognized only when each part is isolated from the rest of the object by the critical spacing. In that case, recognition is by parts. Recognition is holistic if the observer can recognize the object even when the whole object fits within a critical spacing. Such an object has only one part. Multiple parts within an isolation field will crowd each other and spoil recognition. To assess the robustness of the crowding test, we manipulated familiarity through inversion and the face- and word-superiority effects. We find that threshold contrast for word and face identification is the product of two factors: familiarity and crowding. Familiarity increases sensitivity by a factor of ×1.5, independent of eccentricity, while crowding attenuates sensitivity more and more as eccentricity increases. Our findings show that observers process words and faces in much the same way: The effects of familiarity and crowding do not distinguish between them. Words an
    corecore